-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix wrong behavior of DDPStrategy
option with simple GAN training using DDP
#20936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix wrong behavior of DDPStrategy
option with simple GAN training using DDP
#20936
Conversation
for more information, see https://pre-commit.ci
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
…dp-implementation test: cover MultiModelDDPStrategy
examples/pytorch/domain_templates/generative_adversarial_net.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
@SkafteNicki could you pls check too :) |
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
@samsara-ku please mark also the comments as resolved to keep track what has left :) |
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are getting close with this PR, just a few suggestions
from lightning.pytorch.strategies.ddp import MultiModelDDPStrategy | ||
|
||
|
||
def test_multi_model_ddp_setup_and_register_hooks(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add one line docstring what the test does
) | ||
|
||
|
||
def test_multi_model_ddp_register_hooks_cpu_noop(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add one line docstring what the test does
Co-authored-by: Nicki Skafte Detlefsen <[email protected]>
What does this PR do?
Fixes #20866
Fixes #20328
Fixes #18740
Fixes #17212
This PR adds
MultiModelDDPStrategy
class and its simple execution example, for the multi-gpu training with GAN training.Simply speaking:
Currently, pytorch lightning simple GAN training has has problem with
DistributedDataParallel
strategy. It tries to wrappl.trainer
, not thenn.Module
models in thepl.trainer
Although we can activate
find_unused_parameters=True
options to avoid this issue but it is not right way; I think it is just a trick.So the key idea to solve this issue is that we assign
DistributedDataParallel
to the each model in thepl.trainer
, different from the previous strategyDDPStrategy
.I already tested with my GPUs to visaulize the result and tracked the gradients of model with each epoch; it works and you can see the visulized result in thie google drive link
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20936.org.readthedocs.build/en/20936/